Live R Coding Session

Author
Affiliation

Jeremy Springman

University of Pennsylvania

Published

July 24, 2024

RStudio Layout

Before using R to illustrate basic programming concepts and data analysis tools, we will get familiar with the RStudio layout.

Rstudio contains 4 panes

RStudio has four primary panels that will help you interact with your data. We will use the default layout of these panels.

  • Source panel: Top left
    • Edit files to create ‘scripts’ of code
  • Console panel: Bottom left
    • Accepts code as input
    • Displays output when we run code
  • Environment panel: Top right
    • Everything that R is holding in memory
    • Objects that you create in the console or source panels will appear here
    • You can clear the environment with the broom icon
  • Viewer panel: Bottom-right
    • View graphics that you generate
    • Navigate files

Illustration

Let’s use these panels to create and interact with data.

Console:

  • Perform a calculation: type 2 + 2 into the console panel and hit ENTER
  • Create and store an object: type sum = 2 + 2 into the console panel and hit ENTER

Source:

  • Start an R script: Open new .R file (button in top-left below “File”)
  • Create and store an object: type sum = 2 + 3 into the source panel and hit cntrl+ENTER

Environment:

  • Confirm that the object sum is stored in our environment
  • Use rm(sum) to clear the object from the environment
  • Clear the environment with the broom icon

Viewer:

  • Navigate through your computer’s files
  • Create a plot in the source panel

Review of Basic Programming Concepts

Now that we understand the layout, we are ready to review the concepts covered in Module 2 Week 2.2. These concepts will help us understand what is happening when we create and manipulate data.

Objects: where values are saved in R

“Object” is a generic term for anything that R stores in the environment. This can include anything from an individual number or word, to lists of values, to entire datasets.

Importantly, objects belong to different “classes” depending on the type of values that they store.

  • Characters are text or strings like "hello world" and "welcome to R".
  • Factors are a group of characters/strings with a fixed number of unique values
  • Logicals are either TRUE or FALSE
# Create a numeric object
my_number = 5.6
# Check the class
class(my_number)
[1] "numeric"
# Create a character object
my_character = "welcome to R"
# Check the class
class(my_character)
[1] "character"
# Create a logical object
my_logical = FALSE
# Check the class
class(my_logical)
[1] "logical"

R can perform operations on objects.

# Create a numeric object
my_number = 5.6
# Check the class
class(my_number)
[1] "numeric"
# Perform a calculation
my_number + 5
[1] 10.6

The class of an object determines the type of operations you can perform on it. Some operations can only be run on numeric objects (numbers).

# Create a character object
my_number = "5.6"
# Check the class
class(my_number)
# Perform a calculation
my_number + 5
round(my_number)

R contains functions that can convert some objects to different factors.

# Convert character to numeric
my_number = as.numeric("5")
class(my_number)
[1] "numeric"
# But R is only so smart
my_number = as.numeric("five")
class(my_number)
[1] "numeric"

Data Structures

The most simple objects are single values, but most data analysis involves more complicated data structures.

Lists

Lists are a type of data structure that store multiple values together. Lists are created using c() and allow you to perform operations on a series of values.

# Create a numeric list (also called a "vector")
numeric_vector = c(6, 11, 13, 31)
# Print the vector
print(numeric_vector)
[1]  6 11 13 31
# Check the class
class(numeric_vector)
[1] "numeric"
# Calculate the mean
mean(numeric_vector)
[1] 15.25

An important part of working with more complex data structures is called “indexing.” Indexing allows you to extract specific values from a data structure.

# Extract the 2nd element from the list
numeric_vector[2]
[1] 11
# Extract elements 2-4
numeric_vector[2:4]
[1] 11 13 31
# Extract elements 1-2
numeric_vector[c(TRUE, TRUE, FALSE, FALSE)]
[1]  6 11

Dataframes

Data frames are the most common type of data structure used in research. Data frames combine multiple lists of values into a single object.

# Create a dataframe
my_data = data.frame(
  x1 = rnorm(100, mean = 1, sd = 1),
  x2 = rnorm(100, mean = 1, sd = 1)
)

class(my_data)
[1] "data.frame"

Anything that comes in a spreadsheet (for example, an excel file) can be loaded into an R environment as a dataframe. R works most easily when spreadsheets are saved as a .csv file.

In most data frames, rows correspond to observations and the columns correspond to variables that describe the observations. Here, we are looking at survey data from an RCT involving university students in Addis Ababa. Each row correspondents to a different survey respondent, and each column represents their answers to a different question from the survey.

Loading Packages

Packages are an extremely important part of data analysis with R.

  • R gives you access to thousands of “packages” that are created by users
  • Packages contain bundles of code called “functions” that can execute specific tasks
  • Use install.packages() to install a package and library() to load a package

In the next section, we’ll use the package dplyr to perform some data cleaning. dplyr is part of a universe of packages called tidyverse. Since this is one of the most important packages in the R ecosystem, let’s install and load it.

Cleaning Data

In the real-world, data never comes ready to be analyzed. Data cleaning is the process of manipulating data so that it can be analyzed. This is usually the most difficult and time-consuming part of any data analysis project. Let’s walk through some examples.

Creating Variables

Imagine we want to analyze the relationship between whether a respondent moved to to Addis Ababa to attend university and their level of political participation. However, there are two problems:

  • We don’t have a specific variable that measures whether or not respondents moved
  • We have many measures of participation

How can we create a variable measuring whether the respondent moved to Addis Ababa? We have a multiple-choice question asking students about what region they come from.

  q13_4_1 q13_5_1
1       0       0
2       3       0
3       0       0
4       0       0
5       0       2

Cleaning strings

Regression


  Year I  Year II Year III 
     327      277      221 

  0   1 
327 221 

Continuous-ish

Cross-Sectional

Appendix

Averaged Z-Scores

More complex


  0   1 
327 221 
tinytable_bew3t9kq25pcvef0idjr
Bivariate Multivariate Interaction
(Intercept) -0.150*** (0.036) -0.114* (0.046) -0.184** (0.060)
moved 0.234*** (0.045) 0.232*** (0.045) 0.334*** (0.072)
year2 -0.060 (0.049) 0.077 (0.092)
year3 -0.053 (0.054) 0.038 (0.085)
moved × year2 -0.194+ (0.109)
moved × year3 -0.142 (0.110)
Num.Obs. 809 809 809
R2 Adj. 0.032 0.031 0.033

Differences over time